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Abstract 

Partial-monitoring games constitute a mathematical framework for sequential decision making problems 
with imperfect feedback: The learner repeatedly chooses an action, the opponent responds with an outcome, 
and then the learner suffers a loss and receives a feedback signal, both of which are fixed functions of 
the action and the outcome. The goal of the learner is to minimize his total cumulative loss. We make 
progress towards the classification of these games based on their minimax expected regret. Namely, we 
classify almost all games with two outcomes and a finite number of actions: We show that their minimax 
expected regret is either zero, 0( VT), 0(r2/3), or 0(r), and we give a simple and efficiently computable 
classification of these four classes of games. Our hope is that the result can serve as a stepping stone toward 
classifying all finite partial-monitoring games. 

Keywords: Online algorithms. Online learning. Imperfect feedback. Regret analysis 



1. Introduction 

Partial-monitoring games constitute a mathematical framework for sequential decision making prob- 
lems with imperfect feedback. They arise as a natural generalization of many sequential decision making 
problems with full or partial feedback such as learning with expert advice 13 El El, the multi-armed bandit 
problem |l5l|6ll3, label efficient prediction ||8l|9l, dynamic pricing HOllTTl, the dark pool problem ||T2|| . the 
apple tasting problem |[T3l . online convex optimization |[T4l[T5l . online hnear |[T6l and convex optimization 
with bandit feedback ifTTl . 

A partial-monitoring game is a repeated game between two players: the learner and the opponent. In 
each round, the learner chooses an action and simultaneously the opponent chooses an outcome. Next, the 
learner receives a feedback signal and suffers a loss; however neither the loss nor the outcome are revealed to 
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the learner. The feedback and the loss are fixed functions of the action and the outcome, and these functions 
are known by both players. The main feature of this model is that it captures that the learner has imperfect 
or partial information about the outcome sequence. In this work, we make the natural assumption that the 
opponent is oblivious, that is, the opponent does not have access to the learner's actions. 

The goal of the learner is to keep his cumulative loss small. However, since the opponent could choose 
the outcome sequence so that the learner suffers as high loss as possible, it is too much to ask for an absolute 
guarantee for the cumulative loss. Instead, a competitive viewpoint is taken and the cumulative loss of the 
learner is compared with the cumulative loss of the best among all the constant strategies, i.e., strategies 
that choose the same action in every round. The difference between the cumulative loss of the learner and 
the cumulative loss of the best constant strategy is called the regret. 

Generally, the regret grows with the number of rounds of the game. If the growth is sublinear then the 
learner is said to be Hannan consistenj^ and in the long run the learner's average loss per round approaches 
the average loss per round of the best action. 

Designing learning algorithms with low regret is the main focus of study of partial-monitoring games. 
For a given game, the ultimate goal is to find out its optimal worst-case (minimax) regret, and design an 
algorithm that achieves it. The minimax regret can be viewed as an inherent measure of how hard the game 
is for the learner. The motivation behind this paper was the desire to determine the minimax regret and 
design an algorithm achieving it for each game in a large class. 

In this paper we restrict our attention to games with a finite number of actions and two outcomes. This 
class is a subset of the class of finite partial-monitoring games, introduced by Piccolboni and Schindel- 
hauer |fT9l , in which both the set of actions and the set of outcomes are finite. 



1.1. Previous Results 

For full-information games (i.e., when the feedback determines the outcome) with N actions and losses 
lying in the interval [0, 1], there exists a randomized algorithm with expected regret at most s/T ln{N)/2 
where T is the time horizon (see e.g., Lugosi and Cesa-Bianchi EOl Chapter 4] and references therein). 
Furthermore, it is known that this upper bound is tight: There exist full-information games with losses lying 
in the interval [0, 1] for which the worst-case expected regret of any algorithm is at least Q{ Vr InN) 11201 
Chapter 3]. 

Another special case of partial-monitoring games is the multi-armed bandit game, where the learner's 
feedback is the loss of the action he chooses. For a multi-armed bandit game with N actions and losses lying 
in the interval [0, 1], the INF algorithm 1211 has expected regret at most 0{ VtW). (The well-known Exp3 
algorithm [5 1 achieves the bound 0{ -^TWloglV).) It is also known that the bound 0{ ^JTN) is optimal ||5l. 

Piccolboni and Schindelhauer [ 19] introduced finite partial-monitoring games. They showed that, for 
any finite game, either there is a strategy for the learner that achieves regret of at most C?(r^'''*(ln T)^^^) or 
the worst-case expected regret of any learner is Q.{T). Cesa-Bianchi et al. [22] improved this result and 
showed that Piccolboni and Schindelhauer s algorithm achieves 0{T^^^) regret. They also gave an example 
of a game with worst-case expected regret at least Q(r^''^). More recently, Lugosi et al. Il23l designed 
algorithms and proved upper bounds in a slightly different setting, where the feedback signal is a possibly 
noisy function of the outcome or both the action and the outcome. 

However, from these results it is unclear what determines which games have minimax regret 0( VT), 
which games have minimax regret &{T^^^) and whether there exist finite games with minimax regret not 



'Hannan consistency is named after James Hannan wiio was tiie first to design a learning algoritlrm with sublinear regret for 
finite games with full feedback H81 . 
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belonging to either of these categories. Cesa-Bianchi et al. Il22i note that: "It remains a challenging problem 
to characterize the class of problems that admit rates of convergence faster than C?(n^^^^).'|^ 

1.2. Our Results 

We classify the minimax expected regret of finite partial-monitoring games with two outcomes. From 
our classification we exclude certain "degenerate games"; their precise definition is given later in the paper. 
We show that the minimax regret of any non-degenerate game falls into one of the four categories: 0, 
0( VT), @{T^^^), &{T) and no other option is possibl^ We call the four classes of games trivial, easy, 
hard, and hopeless, respectively. We give a simple and efficiently computable geometric characterization of 
these four classes. 

Additionally, we show that each of the four classes admits a computationally efficient learning algorithm 
achieving the minimax expected regret, up to logarithmic factors. In particular, we design an efficient 
learning algorithm for easy games with expected regret at most 0{ VT). For hard games, the algorithm of 
Cesa-Bianchi et al. |[22l has 0{T^^^) regret. For trivial games, a simple algorithm that chooses the same 
action in every round has zero regret. For hopeless games, any algorithm has &{T) regret. 

2. Basic Definitions and Notations 

A ffnite partial-monitoring game is speciffed by a pair of N x M matrices (L, H) where N is the number 
of actions, M is the number of outcomes, L is the loss matrix, and H is the feedback matrix. We use the 
notation n- {I, . . . ,n] for any integer and denote the actions and outcomes by integers starting from 1, so 
the action set is N_ and the outcome set is M. We denote by f y and hj j (i e N_, j e M) the entries of L and 
H, respectively. We denote by f the i-th row (/ € N) of L, and we call it the loss vector of action i. The 
elements of L are arbitrary real numbers. The elements of H belong to some alphabet X, we only assume 
that the learner is able to distinguish two different elements of the alphabet. We often use the set of natural 
or real numbers as the alphabet. 

The matrices L, H are known by both the learner and the opponent. The game proceeds in T rounds. In 
each round t = 1,2, ... ,T, the learner chooses an action It € N and simultaneously the opponent chooses 
an outcome 7, e M_, then the learner receives the feedback /z/,,/,. Nothing else is revealed to the learner; in 
particular Jt and the loss f/, 7, remain hidden. 

In principle, both If and /, can be chosen randomly. However, to simplify our treatment, we assume 
that the opponent is deterministic and oblivious to the actions of the learner. Equivalently, we can assume 
that the sequence of outcomes Ji, J2, . . . , Jt is 3. fixed deterministic sequence chosen before the first round 
of the game. On the other hand, it is important to allow the learner to choose his actions It randomly. A 
randomized strategy (algorithm) A of the learner is a sequence of random functions I\,l2, . . . ,It where 
each of the functions maps the feedback from the past outcomes (and learner's internal random "bits") to 
an action; formally : xQ. ^ N. 

The learner is scored according to the loss matrix. In each round t, the learner incurs instantaneous loss 
The goal of the learner is to keep his cumulative loss Y^^i small. The (cumulative) regret of an 
algorithm A is defined as 

T T 

Rt = Rr(A, G) = y €i,j, - min V 1^, . 

^—^ ieN ' 

f=l - t=l 



^They used n instead of T and by rate they mean the average regret per time step. 
^The notation and O hides poly-logarithmic factors in T. 
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In other words, the regret is the excess loss of the learner compared to the loss of the best constant action. 
We denote by Rj- = Rt(A, G) = E[R7'(A, G)] the (cumulative) expected regret. Let the worst-case expected 
regret of A when used in G = (L, H) be 

Rr(A,G)= sup Rt(A,G), 

where the supremum is taken over all outcome sequences 71:7- - {J\,J2, ■ ■ ■,Jt) ^ M^- The minimax 
expected regret of G (or minimax regret, for short) is: 

Rr(G) = inf Rr(A, G) = inf sup Rr(A, G) , 

^ ^ Ji..t€MT 

where the infimum is taken over all randomized strategies A. Note that, since RrCA, G) > for constant 
outcome sequences, Rr(G) > also holds. 

We identify the set of all probability distributions over the set of outcomes M with the probability 
simplex Am ^ [p e W'^ : Zf=i pij) = 1, Vj e M, p{j) > 0}. We use <•, •) to denote the standard dot 
product. 

3. Characterization of Games witli Two Outcomes 

In this section, we formally phrase our main characterization result. We need a preliminary definition 
that is useful for any finite game: 

Definition 1 (Properties of Actions). Let G = (L, H) be a finite partial-monitoring game with actions 
and M outcomes. Let ? e iV be one of its actions. 

• Action i is called dominated if for any p e Am there exists an action i' such that ^ ti and {(i',p) < 
{ii,p). 

• Action i is called non-dominated if it is not dominated. 

• Action i is called degenerate if it is dominated and there exists a distribution p e Am such that for all 
i' eN, {(i,p)<{€i',p). 

• Action i is called all-revealing if any pair of outcomes j, /, j + f satisfies hij + hij. 

• Action / is called none-revealing if any pair of outcomes j,j' satisfies htj = hij . 

• Action i is called partially-revealing if it is neither all-revealing nor none-revealing. 

• AU-reveaUng and partially-revealing actions together are called revealing actions. 

• Two or more actions with the same loss vector are called duplicate actions. 

The property of being dominated has an equivalent dual definition. Namely, action / is dominated if there 
exists a set of actions with loss vectors not equal to {{ such that some convex combination of their loss 
vectors is componentwise upper bounded by 

In games with M = 2 outcomes, each action is either all-revealing or none -revealing. This dichotomy is 
one of the key properties that lead to the classification theorem for two-outcome games. To emphasize the 
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□ Revealing non-dominated action 

■ Non-revealing non-dominated action 

O Dominated action (revealing or non-revealing) 




Figure 1: The figure shows each action i as a point in R- with coordinates (^i,i,^,,2)- The solid line connects the chain of non- 
dominated actions, which, by convention are ordered according to their loss for the first outcome. 



dichotomy, from now on we will refer to them as revealing and non-revealing whenever it is clear from the 
context that M -2. 

The above property also allows us to assume without of loss generality that there are no duplicate 
actions. Clearly, if multiple actions with the same loss vector exist, all but one can be removed (together 
with the corresponding rows of L and H) without changing the minimax regret: If all of them are non- 
revealing, we keep one of the actions and remove all the others. Otherwise, we keep a revealing action 
and remove the others. Then replacing any algorithm by one that, instead of a removed action, chooses 
always the corresponding kept action, its loss cannot increase and equals to the loss of this algorithm for the 
original game. So the two games have the same minimax regret. 

The concepts of dominated and non-dominated actions can be visualized for two-outcome games by 
drawing the loss vector of each action as a point in M^. The points corresponding to the non-dominated 
actions lie on the bottom-left boundary of the convex hull of the set of all the actions, as shown in Figure [T] 
Enumerating the non-dominated actions ordered according to their loss for the first outcome gives rise to a 
sequence {11,12, ■ ■ ■ , ik), which we call the chain of non-dominated actions. 

To state the classification theorem, we introduce the following conditions. 

Separation Condition. A two-outcome game G satisfies the separation condition if, after removing dupli- 
cate actions, its chain of non- dominated actions does not have a pair of consecutive actions it, 4+1 such 
that both of them are non-revealing. The set of games satisfying this condition will be denoted by S. 

Non-degeneracy Condition. A two-outcome game G is degenerate if it has a degenerate revealing action. 
IfG is not degenerate, we call it non-degenerate and we say that it satisfies the non-degeneracy condition. 

As we will soon see, the separation condition is the key to distinguish between hard and easy games. 
On the other hand, the non-degeneracy condition is merely a technical condition that we need in our proofs. 
The set of degenerate games is excluded from the characterization, as we do not know the minimax regret 
of these games. We are now ready to state our main result. 
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Theorem 2 (Classification of Two-Outcome Partial-Monitoring Games). Let S be the set of all finite partial- 
monitoring games with two outcomes that satisfy the separation condition. Let G - (L, H) be a game with 
two outcomes that satisfies the non-degeneracy condition. Let K be the number of non-dominated actions 
in G, counting duplicate actions only once. The minimax expected regret RrCG) satisfies 



Rr(G) 



(vr), 
e(Vr), 
0(r2/3), 
0(r), 



K= 1; 

K>2,GeS; 

K > 2, G i S, G has a revealing action; 
otherwise. 



(la) 
(lb) 

(Ic) 
(Id) 



We call the games in cases (|Ta|-(|Td|) trivial, easy, hard, and hopeless, respectively. Case ( [Ta| ) is proven 
by the following lemma which shows that a trivial game is also characterized by having minimax regret 
in a single round or by having an action "dominating" alone all the others: 

Lemma 3. For any finite partial-monitoring game, the following four statements are equivalent: 

a) The minimax regret is zero for each T. 

b) The minimax regret is zero for some T. 

c) There exists a (non-dominated) action i e N_ whose loss is not larger than the loss of any other action 
irrespectively of the choice of Nature's action. 

d) The game is trivial, i.e., K - \ (using the definition in Theorem^. 

The proof of this lemma can be found in the Appendix. Case ( fTd] ) of Theorem |2] is proven in the 



Appendix as well. The upper bound of case ( |Tc| ) can be derived from a result of Cesa-Bianchi et al. 
Recall that the entries of H can be changed without changing the information revealed to the learner as long 
as one does not change the pattern of which elements in a row are equal and different. Cesa-Bianchi et al. 

II22II show that if the entries of H can be chosen such that rank(H) = rank| ^ j then 0{T^I^) expected 

regret is achievable. This condition holds trivially for two-outcome games with at least one revealing action 



and > 2. It remains to prove the upper bound for case ( [Tb| ), the lower bound for ( [Tb| ), and the lower bound 
for ( [Tc| ); we prove these in Sections [5j|6j and [7] respectively. 



4. Examples 

Before we dive into the proof of Theorem|2j we give a few examples of finite partial-monitoring games 
with two outcomes and show how the theorem can be applied. For each example we present the matrices 
L, H and depict the loss vectors of actions as points in R^. 

Example 4 (One-Armed Bandit). We start with an example of a multi-armed bandit game. Multi-armed 
bandit games are those where the feedback equals the instantaneous loss, that is, when L = H. |^ 



''"Classically", non-stochastic multi-armed bandit problems are defined by the restriction that in no round Learner can gain any 
information about the losses of actions other than the chosen one, that is, L is not known in advance to Learner (Also, the domain 
set of losses is often infinite there (M = 00).) When H = L in our setting, depending on L, this might or might not be the case; 
the "classical bandit" problem with losses constrained to a finite set is a special case of games with H = L, however, the latter 
condition allows also other types of games where the Learner can recover the losses of actions not chosen, and so which could be 
"easier" than classical bandits due to the knowledge of L. Nevertheless, it is easy to see that these games are at most as hard as 
classical bandit games. 
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□ Revealing non-dominated action 
B Non-rcvcaling non-dominated action 



Because the loss of the first action is regardless of the outcome, and the loss varies only for the second 
action, we call this game a one-armed bandit game. Both actions are non-dominated and the second one is 
revealing, therefore it is an easy game and according to Theorem [l] its minimax regret is 0( VT). (For this 
specific game, it can be shown that it is in fact 0( VT).) 

Example 5 (Apple Tasting). Consider an orchard that wants to hand out its crop of apples for sale. How- 
ever, some of the apples might be rotten. The orchard can do a sequential test. Each apple can be either 
tasted (which reveals whether the apple is healthy or rotten) or the apple can be given out for sale. If a 
rotten apple is given out for sale, the orchard suffers a unit loss. On the other hand, if a healthy apple is 
tasted, it cannot be sold and, again, the orchard suffers a unit loss. This can be formalized by the following 
partial-monitoring game |13|: 

Revealing non-dominated action 
Non-revealing non-dominated action 



1 
1 



H 



a a 




The first action corresponds to giving out the apple for sale, the second corresponds to tasting the apple; 
the first outcome corresponds to a rotten apple, the second outcome corresponds to a healthy apple. Both 
actions are non-dominated and the second one is revealing, therefore it is an easy game and according to 
Theorem|2]the minimax regret is 0( VT). This is apparently a new result for this game. Also notice that the 
picture is a just a translation of the picture for the one-armed bandit. 

Example 6 (Label Efficient Prediction). Consider a situation when we would like to sequentially classify 
emails as spam or as legitimate. For each email we have to output a prediction, and additionally we can 
request, as feedback, the correct label from the user. If we classify an email incorrectly or we request its 
label, we suffer a unit loss. (If the email is classified correctly and we do not request the feedback, no loss 
is suffered.) This can be formalized by the following partial-monitoring game |[22l : 

^•,24 

B Non-revealing non-dominated action 
O Revealing dominated action 





'\ 


r 




'a 




L = 





1 


, H = 


c 


c 




.1 


oj 




^d 
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where the first action corresponds to a label request, and the second and the third action correspond to a 
prediction (spam and legitimate, respectively) without a request. The outcomes correspond to spam and 
legitimate emails. 

We see that the chain of non-dominated actions contains two neighboring non-revealing actions and 
there is a dominated revealing action. Therefore, it is a hard game and, by Theorem[2j the minimax regret is 
&{T^/^). This specific example was the only game known so far with minimax regret at least C1{T^/^) Il22l 
Theorem 5.1]. 

Example 7 (A Hopeless Game). The following game is an example where the feedback does not reveal 
any information about the outcome: 



1 
1/' 



H 



a a 




Non-revealing non-dominated action 



Because both actions are non-revealing and non-dominated, it is a hopeless game and thus its minimax 
regret is ©(T). 

Example 8 (A Trivial Game). In the following game, the best action, regardless of the outcome sequence, 
is action 2. A learner that chooses this action in every round is guaranteed to have zero regret. 

□ Revealing non-dominated action 





'2 


r 




'a 




L = 


1 





, H = 


c 


d 




a 


1. 






f) 



O Revealing dominated action 



-B- 



Because this game has only one non-dominated action (action 2), it is a trivial game and thus its minimax 
regret is 0. 

Example 9 (A Degenerate Game). The next game does not satisfy the non-degeneracy condition and there- 
fore Theorem |2] does not apply. 



•'2 A 





'2 


0^ 




'a 


a 


L = 


1 


1 


, H - 


b 


c 




.0 


2j 










B Non-revealing non-dominated action 
O Revealing dominated (degenerate) action 
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Its minimax regret is between Q.{ VT) and 0(T^^^). It remains an open problem to close this gap and 
determine the exact rate of growth. 

5. Upper bound for easy games 

In this section we present our algorithm for games satisfying the separation condition and the non-de- 
generacy condition, and prove that it achieves 0{ VT) regret with high probability. We call the algorithm 
AppleTree since it builds a binary tree, leaves of which are apple tasting games. 

5.1. The algorithm 

In the first step of the algorithm we can purify the game by first removing the dominated actions and 
then the duplicates as mentioned beforehand. 

The idea of the algorithm is to recursively split the game until we arrive at games with two actions only. 
Now, if one has only two actions in a partial-information game, the game must be either a full-information 
game (if both actions are revealing) or an instance of a one-armed bandit (with one revealing and one 
non-revealing action). 

To see why this latter case corresponds to one-armed bandits, assume without loss of generality that 
the first action is the revealing action. Now, it is easy to see that the regret of a sequence of actions in a 
game does not change if the loss matrix is changed by subtracting the same number from a column]^ By 
subtracting £2,1 from the first and €2,2 from the second column we thus get the equivalent game where the 
second row of the loss matrix is zero, arriving at a one-armed bandit game (see Example |4]l. Since a one- 
armed bandit is a special form of a two-armed bandit, one can use Exp3.P due to Auer et al. ||5l to achieve 
the 0{ VT) regret. 

Now, if there are more than two actions in the game, then the game is split, putting the first half of the 
actions into the first and the second half into the second subgame, with a single common shared action. 
Recall that, in the chain of non-dominated actions, the actions are ordered according to their losses corre- 
sponding to the first outcome. This is continued until the split results in games with two actions only. The 
recursive splitting of the game results in a binary tree (see Figure [2]l. The idea of the strategy played at 
an internal node of the tree is as follows: An outcome sequence of length T determines the frequency pj 
of outcome 2. If this frequency is small, the optimal action is one of the actions of Gi, the first subgame 
(simply because then the frequency of outcome 1 is high and G\ contains the actions with the smallest loss 
for the first outcome). Conversely, if this frequency is large, the optimal action is one of the actions of G2. In 
some intermediate range, the optimal action is the action shared between the subgames. Let the boundaries 
of this range be p* < p* (p* is thus the solution to (1 - p){s~\,i + p(s~i,2 = (1 - p)^s,i + P^s,2 and is the 
solution to (1 - p)is+i,i + P^s+1,2 - (1 - p)^.s.i + where s = IK/l] is the index of the action shared 
between the two subgames.) 

If we knew pj, a good solution would be to play a strategy where the actions are restricted to that of 
either game Gi or G2, depending on whether pr < p\ or pr > p*2. (When Pj < pr < Pj then it does 
not matter which action-set we restrict the play to, since the optimal action in this case is included in both 
sets.) There are two difficulties. First, since the outcome sequence is not known in advance, the best we 
can hope for is to know the running frequencies Pt - j Z'=i ^i-^s = 2). However, since the game is a 
partial-information game, the outcomes are not revealed in all time steps, hence, even p, is inaccessible. 



^As a result, for any algorithm, if R7- is its regret at time T when measured in the game with the modified loss matrix, the 
algorithm's "true" regret will also be Rt (i.e., the algorithm's regret when measured in the original, unmodified game). Piccolboni 
and Schindelhauer 1191 exploit this idea, too. 
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'□' Q M □ 

Figure 2; The binary tree built by the algorithm. The leaf nodes represent neighboring action pairs. 



Nevertheless, for now let us assume that pt was available. Then one idea would be to play a strategy 
restricted to the actions of either game G\ or G2 as long as p, stays below p* or above p*. Further, when p, 
becomes larger than p* while previously the strategy played the action of Gi then we have to switch to the 
game G2. In this case, we start a fresh copy (a reset) of a strategy playing in G2. The same happens when a 
switch from G2 to game Gi is necessary. These resets are necessary because at the leaves we play according 
to strategies that use weights that depend on the cumulated losses of the actions exponentially. To see an 
example when without resets the algorithm fails to achieve a small regret consider the case when there are 
3 actions, the middle one being revealing. Assume that during the first T/2 time steps the frequency of 
outcome 2 oscillates between the two boundaries so that the algorithm switches constantly back and forth 
between the games G\ and G2. Assume further that in the second half of the game, the outcome is always 
2. This way the optimal action will be 3. Nevertheless, up to time step T/2, the player of G2 will only see 
outcome 1 and thus will think that action 2 is the optimal action. In the second half of the game, he will not 
have enough time to recover and will play action 2 for too long. Resetting the algorithms of the subgames 
avoids this behavior. 

If the number of switches was large, the repeated resetting of the strategies could be equally problem- 
atic. Luckily this cannot happen, hence the resetting does minimal harm. We will in fact show that this 



generalizes to the case even when pt is estimated based on partial feedback (see Lemma 1 1 



Let us now turn to how p, is estimated. As mentioned in Section [3j mapping a row of H bijectively 
leads to an equivalent game, thus for M = 2 we can assume without loss of generality that in any round, 
the algorithm receives (possibly random) feedback Hf € {1,2,*}: if a revealing action is played in the 
round, Ht = J, e |1,2), otherwise H, = *. Let 'Hv.t-i = {h,H\, . . .,It-\,Ht-\) e (Nx I,y~\ the (random) 
history of actions and observations up to time step ? - 1 . If the algorithm choosing the actions decides with 
probability pt € (0, 1] to play a revealing action (pt can depend on 'Hij-i) then I{Ht = 2) /pt is a simple 
unbiased estimate of I (7, = 2) (in fact, E [I (Ht = 2) / Ptl'Hij-i] = I(Jt = 2)). As long as pt does not drop 



to a too low value, Pt - j Zv: 



f Mi: 
1 P, 



=2) 



will be a relatively reliable estimate of p, (see Lemma 



12 1. However 



reliable this estimate is, it can still differ from p,. For this reason, we push the boundaries determining game 
switches towards each other: 



Pi 



2p;+p* 



p\ + 2p; 



(2) 



We call the resulting algorithm AppleTree, because the elementary partial-information 2-action games 
in the bottom essentially correspond to instances of the apple tasting problem (see Example [5]). The algo- 
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function Main(G, T, 6) 
Input: G = (L, H) is a game, T is a horizon, 
< 5 < 1 is a confidence parameter 

1: G ^ Purify(G) 

2: BuiLDTREE(root, G, 5) 

3: for f <- 1 to r do 

4: PLAY(root) 

5: end for 

Figure 3: The main entry point of the AppleTree algo- 
rithm 

function InitEta(G, T) 

Input: G is a game, T is a horizon 



if IsRevealing(G, 2) then 
?7(v) ^ V81n2/r 

else 

77(v) ^ 7(v)/4 
end if 



function BuildTree(v, G, 6) 

Input: G = (L, H) is a game, v is a tree node 

1: if NumOfActions(G) = 2 then 

2: if not 1sRevealing(G, 1) then 

3: G <- SwapActions(G) 

4: end if 

5: Wiiv) ^ 1 /2, i ^1,2 

6: y6(v) ^ Vln(2/5)/(2r) 

7: y(v)^8Av)/(3+Av)) 

8: 1nitEta(G, r) 

9: else 

10: (Gi , G2) ^ SplitGame(G) 

11: BuildTree(Child(v, 1), Gi , (5/(47) ) 

12: BuildTree(Child(v, 2), G2, (5/(47) ) 

13: g(v)^ l,p(v)^0,f(v)^ 1 

14: (p\{v),p'2{v)) <— BOUNDARIES(G) 

15: end if 

16: G(v) ^ G 



Figure 4: The initialization routine InitEta. 



Figure 5: The tree building procedure 



rithm's main entry point is shown on Figure [3] Its inputs are the game G = (L, H), the time horizon and a 
confidence parameter < 6 < 1. The algorithm first eliminates the dominated and duplicate actions. This 
is followed by building a tree, which is used to store variables necessary to play in the subgames (Figure [5]l: 
If the number of actions is 2, the procedure initializes various parameters that are used either by a bandit 
algorithm (based on Exp3.P [5|), or by the exponentially weighted average algorithm (EWA) [4|. In the 
other case, it calls itself recursively on the split subgames and with an appropriately decreased confidence 
parameter. 

The main worker routine is called Play. This is again a recursive function (see Figure [6]l. The special 
case when the number of actions is two is handled in routine PlayAtLeaf, which will be discussed later. 
When the number of actions is larger, the algorithm recurses to play in the subgame that was remembered 
as the game to be preferred from the last round and then updates its estimate of the frequency of outcome 
2 based on the information received. When this estimate changes so that a switch of the current preferred 
game is necessary, the algorithm resets the algorithms in the subtree corresponding to the game switched to, 
and changes the variable storing the index of the preferred game. The Reset function used for this purpose, 
shown on Figure|7] is also recursive. 

At the leaves, when there are only two actions, either EWA or Exp3.P is used. These algorithms are 
used with their standard optimized parameters (see Corollary 4.2 for the tuning of EWA, and Theorem 6. 10 
for the tuning of Exp3.P, both from the book of Lugosi and Cesa-Bianchi [20J). For completeness, their 
pseudocodes are shown in Figures [8]-|9] Note that with Exp3.P (lines [6- 14 1 we use the loss matrix trans- 
formation described earlier, hence the loss matrix has zero entries for the second (non-revealing) action, 
while the entry for action 1 and outcome j is ^ijiv) - £2j(v)- Here ^,j(v) stands for the loss of action / and 
outcome j in the game G(v) that is stored at node v. 
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function Play(v) 
Input: V is a tree node 

if NumOfActions(G(v)) ^ 2 then 
ip,h) <— PlayAtLeaf(v) 

else 

(p,/l)^PLAY(CHILD(v,g(v))) 

- (1 - ^)P(v) + ^ M 
if g{v) = 2 and p(v) < Pj(v) then 



Reset(Child(v, 1)); g{v) «— 1 
else if ^(v) - 1 and p(v) > p'^iv) then 

RESET(CmLD(v, 2)); g(v) <— 2 
end if 

f(v) ^ f(v) + 1 
end if 

return (p, h) 



function Reset(v) 

Input: V is a tree node 
1: if NumOfActions(G(v)) = 2 then 
2: w,(v) ^ 1/2, / ^ 1,2 
3: else 

5: Reset(Child(v, 1)) 
6: end if 



Figure 7: Function Reset 



Figure 6: The recursive function Play 



5.2. Proof of the upper bound 

Theorem 10. Assume G - (L, H) satisfies the separation condition and the non-degeneracy condition and 
tij < 1. Denote by Rj- the regret of Algorithm AppleTree up to time step T. There exist constants c,p such 
that for any < 6 < \ and T € N, for any outcome sequence J\,. . .,Jj, the algorithm with input G, T, 6 
achieves Pr [Rj- < c Vrin''(2r/(5)] >\-5 . 

Throughout the proof we will analyze the algorithm's behavior at the root node. We will use time indices 
as follows. Let us define the filtration {ft = o-{I\, . . . , /,));, where It is the action the algorithm plays at time 
step t. To any variable x{v) used by the algorithm, we denote by Xt{v) the value of x(v) that is measurable 
with respect to ft, but not measurable with respect to ft-i- From now on we abbreviate X;(root) by Xt. We 
start with two lemmas. The first lemma shows that the number of switches the algorithm makes is small. 

Lemma 11. Let S be the number of times AppleTree calls Reset at the root node. Then there exists a 
universal constant c* such that S < ^ where A = p'^- p\ with p'^ and p'^ given by Q. 

Note that here we use the non-degeneracy condition to ensure that A > 0. 

Proof. Let s be the number of times the algorithm switches from G2 to Gi. Let < • • • < be the time 
steps when pt becomes smaller than p'^. Similarly, let < • • • < t'^^^, € {0, 1}) be the time steps when p, 
becomes greater than p'^. Note that for all \ < i < s, t'- < tj < t'.^^. Finally, for every 1 < j < 5, we define 
t'j = min{f \ t'j < t < tj, (Vf < T < tj : Pt < 1)|. In other words, t'J is the time step when pt drops below 1 
and stays there until the next reset. 

First we observe that if t" > 2/A then pf > (p\ + P9)/2. Indeed, if t'! = f. then pf > pL on the other 
hand, if t'J t'. then pt"~\ > 1 and, from the update rule we have 

. f, iV 1 A p;+p^ 
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function PlayAtLeaf(v) 
Input: V is a tree node 
1: if RevealingActionNumber(G(v)) = 2 then > 
Full-information case 

(p, h) <— Ewa(v) 
else > Partial-information case 

P-(l-r(v))^;;^(^+r(v)/2 
U ~ 'i/[o,i) > ?7 is uniform in [0, 1) 

\i U < p then > Play revealing action 

/z^CHOOSE(l) >/j€|l,2) 

L2^P{v)l{\-p) 
wi(v) ^ wi(v)exp(-?7(v)Li) 
W2(v) <- H'2(v)exp(-77(v)L2) 

else 

h ^ CH00SE(2) > here h = * 

end if 
end if 

return (p, h) 



function Ewa(v) 
Input: V is a tree node 



wi(v) 



" W\(v)+W2(v) 

U - 'W[o,i) 
if U < p then 
/ ^ 1 

else 

7^2 
end if 

h ^ CHOOSE(/) 



> U is uniform in [0, 1) 



>he {1,2} 



wi(v) ^ wi(v)exp(-77(vXi,/i(v)) 
W2(v) ^ W2(v) exp(-77(v)^2,/!(v)) 
return (p, /j) 



Figure 8: Function PlayAtLeaf 



Figure 9: Function Ewa 



The number of times the algorithm resets is at most 2^-1-1. Let j* be the first index such that t'J, > 2/ A. 
For any / < j < s, pf > {p\ + p'j)l1 and pt^ < p[. According to the update rule we have for any t'J < t <tj 
that 



. I 1\ 1 1(7,^2) 1. ^ . 1 

\ t j t pt t t 

and hence pt-i - pt < \ ■ Summing this inequality for all t'! + \ < t < tj such that j > j* we get 

A p;+P2 , ^ . . 

-^-^--Py<Pf;-Pt, 



t=t"+\ 



Thus, there exists c > such that for all j* < j < s 



1 tj tj 

-A < In 4 < In — . 

c t' tj., 



(3) 



Adding Q for j* < j < s we get (s - 7*)^ A < In 2^ < In T . We conclude the proof with observing that 



/ < 2/A. 



□ 



The next lemma shows that the estimate of the relative frequency of outcome 2 is not far away from its 
true value. 



Lemma 12. For any < 6 < I, with probability at least 1-6, for all t > 8 ^\n(2T l6)l{3/\\ \pt-pt\ < A. 
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The proof of the lemma employs Bernstein's inequality for martingales. 



Bernstein's inequality for martingales. / I20] Lemma A.8] Let X\,X2, . . . ,Xn be a bounded martingale 
difference sequence with respect to a filtration and with \Xi\ < K. Let 

i 

J=i 

be the associated martingale. Denote the sum of conditional variances by 

n 

Then, for all constants e, v > 0, 



i=l 



Pr 



rnax Si > e and E„ < v 



< exp 



2(v + Ke/3) 



Proof of Lemma 12 For 1 < f < T, let pt be the conditional probability of playing a revealing action at 
time step t, given the history 'Hi j-y. Recall that, due to the construction of the algorithm, pt > 1/ Vt. 

If we write pt in its explicit form A - 7 2is=i we can observe that Elptl'Hi-t-i] = pt, that is, p, 

is an unbiased estimate of the relative frequency. Let us define random variables X^ := '^'-^'"^^ - 1(7^ = 2). 

Since p^ is determined by the history, (Xv) , is a martingale difference sequence. Also, from p^ > 1/ Vt we 
know that Var(X^|'Ki:;-i) < ^^T . Hence, we can use Bernstein's inequality for martingales with e = At, 
v^ty/f,K= Vf: 



Pr[^,-A|>A] = Pr 

< 2 exp ^- 

< 2exp - 



> tA 
Ah^/2 



f a/T + Af Vr/3 
3 Ah 



We have that if ? > 8 y/fln{2T/6)/{3A^) then 

Pr[\pr-pt\> A]<S/T . 
We get the bound for all t e [8 ^/Tln{2T/6)/{3A^), T] using the union bound. 



□ 



Proof of Theorem^W^ To prove that the algorithm achieves the desired regret bound we use induction on the 



depth of the tree, d.lfd=\, AppleTree plays either EWA or Exp3.P. EWA is known to satisfy Theorem 10 



and, as we discussed earlier, Exp3.P achieves 0{ Vt In T/d) regret as well. As the induction hypothesis we 



assume that Theorem 10 is true for any T and any game such that the tree built by the algorithm has depth 
d' <d. 

Let Qi = {1, . . . , \K/2'\], Q2 = {\K/2'], . . .,K} he the sets of actions associated with the subgames in 
the root. (Recall that the actions are ordered with respect to j.) Furthermore, let us define the following 
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values: Let = I, let t9 be the first time step t after T^_^ such that gt-i. In other words, t9 are 
the time steps when the algorithm switches between the subgames. Finally, let - min(r*', T + \). From 
Lemma[TT]we know that Ts^^^+i - T + I, where S'max = It is easy to see that Tj are stopping times 

for any / > 1 . 

Without loss of generality, from now on we will assume that the optimal action /* e Qi. If /* = IK/ll 
then, since it is contained in both subgames, the bound trivially follows from the induction hypothesis and 



Lemma 1 1 In the rest of the proof we assume /* < K/2. 

Let S = max|/ > 1 | r!' < T} be the number of switches, c = and S be the event that for all 



t > c Vrin(4r/(5), \pt - pt\ < A. We know from Lemma |l2| that Pr[S] > 1 - 6/2. On S we have that 
\Pt -pri < A, and thus, using that /* < K/2, pr < p\ - This implies that in the last phase the algorithm plays 
on Gi. It is also easy to see that before the last switch, at time step - 1, p is between p* and p*, if Ts is 
large enough. Thus, up to time step - 1, the optimal action is \K/2'\, the one that is shared by the two 
subgames. This implies that YJfli^ ^i'J, ~ ^IK/2],j, ^ 0. On the other hand, if < c Vt \n{4T/5) then 

Ts-l 



f=i 



Thus, we have 



t=i 

Ts-l T 

(^w,-^/v,) + 2](4,/,-^/v,) 



f=l t=Ts 

'Ts-l T \ 

< ^ - i^Kll\,J,) + (^'-J' - ^'•.^') 

. f=l t=Ts 

+ cVrin(4rM) + (i(s''))r 

D 

<D + I(S)y max y (4,7,-47,) 

5n,ax T T,-i+m-l 

- D + I (S) y max y I (T,. - Tr-i = m) V - tu,) , 

where n{r) is 1 if r is odd and 2 if r is even. Note that for the last line of the above inequality chain to be 
well defined, we need outcome sequences of length at most 2T. It does us no harm to assume that for all 
T <t<2T, say, Jt = 1. 

Recall that the strategies that play in the subgames are reset after the switches. Hence, the sum R,„ = 
^r,^i+m-i (^fj^ j^ _ f. j^-j tj^e regret of the algorithm if it is used in the subgame Gjr(r) for m < T steps. Then, 

—ir) 

exploiting that Tr are stopping times, we can use the induction hypothesis to bound R^ . In particular, let C 
be the event that for all m < T the sum is less than c y[T \n^{2T^ /6). Since the root node calls its children 
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with confidence parameter S/{2T), we iiave tliat Pr[C' ] < 6/2. In summary, 
Rt<D + I (C) r + I (S) I (C) 5maxc Vr In'' 2T^/6 

<I{S'U O T + c Vr ln(4r/(5) + (C) ^-^^c Vr In'' 2T^/6. 

Thus, on S n C, Rr < ^ Vt In^^^ (2r/5) , wliicli, togetlier witli Pr[!B' U C] < 6 concludes tlie proof. □ 

Remark Tlie above tlieorem proves a iiigii probability bound on the regret. We can get a bound on the 
expected regret if we set <5 to 1 / Vt. Also note that the bound given by the induction grows in the number 
of non-dominated actions as 0{K^°^^ 

6. Lower Bound for Non-Trivial Games 

In the following sections, || • ||i and || • || denote the Ly- and L2-norm of a vector in a Euclidean space, 
respectively. 

In this section, we show that non-trivial games have minimax regret at least Q.{ VT). We state and prove 
this result for all finite games, in contrast to earlier related lower bounds which apply to specific losses (see 
Cesa-Bianchi and Lugosi ll20l Theorems 3.7, 6.3, 6.4, 6. 1 1] for full-information, label efficient, and bandit 
games). 

Theorem 13 (Lower bound for non-trivial games). If G = (L,H) is a finite non-trivial (K > 2) partial- 
monitoring game then there exists a constant c > such that for any T > \ the minimax expected regret 
Rr(G) > c Vr. 

The proof presented below works for stochastic nature, as well. There is a far simpler proof in the 
Appendix, however, that one applies only for adversarial nature. 

Recall that Am c is the (M - 1) -dimensional probability simplex. 

For the proof, we start with a geometrical lemma, which ensures the existence of a pair i\,i2 of non- 
dominated actions that are "neighbors" in the sense that for any small enough e > 0, there exists a pair of 
"e-close" outcome distributions p + ev and p-ev such that i\ is uniquely optimal under the first distribution, 
and i2 is uniquely optimal under the second distribution overtaking each non-optimal action by at least Q.{e) 
in both cases. 

Lemma 14 (e-close distributions). Let G - (L, H) be any finite non-trivial game with N non-duplicate 
actions and M > 2 outcomes. Then there exist two non-dominated actions i\,i2 €. N, p e Am, v e \ {0), 
and c,a > satisfying the following properties: 

(a) ii, + li^. 

(b) (iii,p) - {(i2,p) ^ {ii, p) for all i e N and the coordinates of p are positive. 

(c) Coordinates ofv satisfy T,f=i ^(7) = 0- 
For any e € (0, a), 

(d) p\ = p ev € Am and p2 = p - ev e Am, 

(e) for any i e N, i + ii, we have (£( - iii,p\) > ce. 
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(f) for any i e N, 0= ij, we have {£i - ih,p2) > ce. 



Proof of Lemma 14 For any action / € A^, consider the cell 

Ci = {pe^M : V/'eA[, {£i,p) < {ei',p)} 

in the probability simplex Am- The cell C, corresponds to the set of outcome distributions under which 
action / is optimal. Each cell is the intersection of some closed half-spaces and A^, and thus it is a compact 
convex poly tope of dimension at most M - 1 . Note that 

N 

\Jci = ^M. (4) 

For C c Am, denote int C its interior in the topology induced by the hyperplane {x € M'^ : {{\, . . . ,\),x) = 
1} and rint C its relative interioi]^ Let A be the (M - l)-dimensional Lebesgue-measure. It is easy to see that 
for any pair of cells C/, C,', C,' n int d - 0, that is, Aid n d') - 0, and so 

intd QCi\[jCi,. (5) 

Hence the cells form a cell-decomposition of the simplex. Any two cells d and C,' are separated by the 
hyperplane fj' = {x e : {£i, x) - {{i',x)}. Note that C, n C,' c The cells are characterized by the 
following lemma (which itself holds also with duplicate actions): 

Lemma 15. Action i is dominated o C,- c IJ,/.^.,^^. Cr o intC,- = o A{Ci) - 0, that is, Cj is (M - 1)- 
dimensional (has positive A-measure) if and only if there is p e Ci\ U;':/,/#^i C,'. Hence there is three kind 
of "cells ": 

1. Ci = % (action i is never optimal), 

1. Ci i= has dimension less than M—\, intC, = 0, A{Ci) = 0, C,- c C,' (action i is degenerate), 

3. action i is non-dominated, Ci is {M - \)-dimensional, rintC, = intC,- i= 0, A{Ci) > 0, there is 
peCi\ [ji':e,,^e, d'- 
Moreover IJ,-^© C,- = Am for the set T) of dominated actions. 
The proof is in the Appendix. 

The non-triviality of the game (K > 2) means that there are at least two non-dominated actions of type 



3 above. In the cell decomposition, due to Lemma 15 there must exist two such (M - l)-dimensional cells 



C, and Ci, corresponding to two non-dominated actions 11,12, such that their intersection C,, n C;, is an 
(M - 2)-dimensional polytope. Clearly, t {i^, since otherwise the cells would coincide; thus part (a) is 
satisfied. 

Moreover, rint(C,, n C,^) £ rint Am since otherwise A{Ci^) or ^(C/j) would be zero. We can choose any 
p € rint(C,-, n C,j). This choice of p guarantees that p € yi-,,,^, {ii^,p} = {{i2,p), p £ rintAM, and part 
(b) is satisfied. Since Ci^ n C,-, is (M - 2)-dimensional, it also implies that there exists 5 > such that the 
(^-neighborhood {q e : \\p - q\\ < 6] of p is contained in rint(C,-, U C,^). 

Since p e ,2 therefore the hyperplane of vectors satisfying (c) does not coincide with f^^i^ implying 
that we can choose v € M*^ \ |0) satisfying part (c), ||v|| < 6, and v ^ fuh- We can assume 

(4-4,v>>0 (6) 



^Relative interior of C c R*^ is its interior in the topology induced by the smallest affine space containing it. 
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(otherwise we choose -v). Since p±vlie in the 5-neighborhood of p, they lie in rint(C,j U C,^ ). In particular, 
since (f,-, ,p + v)< {£12, p + v> and {£i2,p - v> < (^,-, ,p -v), p + v e rint C,j and p - v e rint d^. Let 

p\ = p + ev and p2 = p - ev . (7) 

The convexity of C,j and implies that for any e € (0, 1], pi € rint Ci^ and p2 e rint Cij. This, in particular, 
ensures that pi,p2 e Am and part (d) holds. 

To prove (e) define I = {i e N_: Ci is coUinear with and ii^}. We consider two cases: As the first case 
fix action i e I \ that is, f,- is an affine combination = aiti^ + biii^ for some a,- + bi - 1. Since i\ and 
/2 are non-dominated, this must be a convex combination with a,, ft, > 0. There is no duplicate action, thus 
ti t ^i, implying hi + 0. Hence hi > 0, and from ([7]l for any 6 > 

- £i„pi) = {hiii2 - hi£i„p + ev) - ehiid^ - ii„v) > ce 

provided that < c < min,gj\j,,j Z7;(^i2 ~ ^h^v) = c' . From ([6]) we know that ^i(^i2 ~ and so c' are 
positive. 

As the second case suppose / ^ I . Then, the hyperplane ^ does not coincide with /i-,,,-2- Since 
p e rint(C/[ D Q,), € ^ would contradict to fi^^i n rintC,-, = implied by ([5]). Thus p € C,-, \ fi^^i and 
therefore {ti^,p) < {€i,p). This means that if we choose < c < min(c', ^ min,-^j(^,- - fi,,p)) (that is 
positive and depends only on L and not on T) then for e < a = min(l, c/ max,-^/ K^, - , v>|), from (jT) we 
have again 

(fi - £i„pi} >2c + e{£i - ^,-,,v> >c>ce. 
Part (f) is proved analogously to part (e), and by adjusting a and c if necessary. □ 

We now continue with a technical lemma, which quantifies an upper bound on the KuUback-Leibler 
(KL) divergence (or relative entropy) between the two distributions from the previous lemma. Recall that 
the KL divergence between two probability distributions p,q e Am is defined as 

M I 

D{p\\q) = Y,PA^^^' 



Lemma 16 (KL divergence of e-close distributions). Let p € Am he a prohahility vector. For any vector 
s e M*' such that hoth p — s and p + s lie in Am and \s{j)\ < p{j)/2for all j e M, the KL divergence of 
p - s and p + s satisfies 



D{p-e\\p + e)< c\\e\f 



for some constant c depending only on p. 



Proof of Lemma 16 Since p, p+s, and p-s are all probability vectors, notice that the coordinates of e have 
to sum up to zero. Also if a coordinate of p is zero then the corresponding coordinate of e has to be zero as 
well. As zero coordinates do not modify the KL divergence, we can assume without loss of generality that 
all coordinates of p are positive. By definition. 



D{p-s\\p + e) = YipU) - In 



We write the logarithmic factor as 



pU) + f^U)l \ pU)I \ p(j) 
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We use the second order Taylor expansion ln(l ± x) = ±x - + 0{\x\^) around to get that ln(l - x) - 
ln(l +x) = -2x + r(x), where r(x) is a remainder upper bounded for all \x\ < 1/2 as |r(x)| < c'|xp with some 
universal constant c' > 0|^ Substituting 

M 

D{p-s\\p + s) = Y,ipU)-sU)) 



-2 1- r 



pii) \pU) 



M M 2t ■\ ^ / / ■\\ 



Here the first term is 0. Letting p = min^gM pij), the second term is bounded by 2 Xf=i s^U)IP - (2/p)l|e| 



and the third term is bounded by 



s M 

^)..'|:«;)-.0)) 



7=1 
M 



pHj) 



M 



sU)\sU)\ \ s\j) 



\ pU) p\i) I pU) 



~PA pU) p\i) I f 



Hence, D(p - s\\ p + s) < 



8+3c' I 
4p I 



cllel 



2 

^ for c 



w 

pU) 

3c' „ 



P 

8+3c' 
4p ■ 



□ 



Proof of Theorem \T3\ The proof is similar as in Auer et al. [5]. When M = 1, G is always trivial, thus we 
assume that M > 2. Without loss of generality we may assume that all the actions are all-revealing. Then, 
as in Section|3]for M=2, we can also assume that there are no duplicate actions, thus for any two actions / 
and £i + {f. 

Lemma[l4|implies that there exist two actions ii,i2, P £ Am, v € , and c\,a > satisfying conditions 
(a)-(f). To avoid cumbersome indexing, by renaming the actions we can achieve that i\ = 1 and ^2 - 2. Let 



p\ = p + ev and p2 = p - ev for some e € (0, a). We determine the precise value of e later. By Lemma 14 
(d), pi,p2 € Am- 

Fix any randomized learning algorithm A and time horizon T. We use randomization replacing the out- 
comes by a sequence Ji, J2, . . . , Jt of random variables i.i.d. according to p^, ^ e {1, 2), and independently 
of the internal randomization of A. Let 

T 



N' 



(k) 



N^:''\A, T) 



2] VvkUt = i] € [0, T] 



(8) 



t=i 



be the expected number of times action / is chosen by A under p^ up to time step T. With subindex k, Prt 
and Ej- denote probability and expectation given outcome model k € {1,2], respectively. 

Lemma 17. For any partial-monitoring game with N actions and M outcomes, algorithm A and outcome 
distribution pt e A^ such that action k is optimal under pk, we have 



Rt{A, G) > 2] Nl^\ii - 4, Pk), k=\,2. 



(9) 



ieN 



^In fact, one can take c' = 8 ln(3/e) a; 0.79. 
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The proof is in the Appendix. 

Parts (e) and (f) of Lemma 14 imply that (4, Pk) < i^i, Pk) for ^ e {1, 2) and any / e N, hence Rr(A, G) 
can be bounded in terms of A^l^^"^ using Lemma [Tv] They also imply that for any / e if f,- t ik then 
{li - ik,Pk) ^ c\e. Therefore, we can continue lower bounding (|9]) as 



2 NT^Ci - 4, Pk) > 2] Nl'^c^e ^c,{T- A^f ) . 



(10) 



ieN 
i*k 



ieN 
i^k 



Collecting Q and ( [10] ), we see that the worst-case regret of A is lower bounded by 

Rt{A,G)>ci{t -Np)€ 



for k € {1,2). Averaging ( [TT] ) over ^ € {1, 2} we get 

Rt-CA, G) > ci {2T - Nf^ - A^f ^) e/2 . 



(11) 



(12) 



We now focus on lower bounding 2T - n\^^ - n'^\ We start by showing that A^*^^ is close to N2^. The 
following lemma, which is the key lemma of both lower bound proofs, carries that out formally and states 
that the expected number of times an action is played by A does not change too much when we change the 
model, if the outcome distributions pi and p2 are "close" in KL-divergence: 

Lemma 18. For any partial-monitoring game with N actions and M outcomes, algorithm A, pair of out- 
come distributions p\,P2 £ cmd action i, we have 



A^(2) _ A^(i) < ^D(;72 II Pi)n2/2 and 



^(1) _ ^(2) < J 



;'2)A^ev/2, 



where N^^l = YJ=i ^^kUt € = YjieU^^-^^ under model pu, k = 1,2 with 'R being the set of revealing 
actions^ 

The proof is in the Appendix. 



We use Lemma 



for / = 2 and that N^H < T to bound the difference A^^^' - A^^" as 
A^f - A^f < T ^D{p2 II pi)Tll = T^l^ ^D{p2 \\ pi)ll . 



7(2) 



(13) 



We upper bound D{p2 \\ p\) using Lemma ^6 with s = ev. The lemma implies that D{p2 \\ Pi) < C2e^ for 
e < eo with some eo,C2 > which depend only on v and p. Putting this together with ([T3|) we get 



where cj, = Vc2/2. Together with A'^j'^ 4- A^2^^ < T we get 



A^f <A^^') + C36r3/2 



2T - n[^^ - N'^' >2T- NY' - N^' - cj.eT"^ >T- c^eT 
Substituting into (12) and choosing e = l/ilc^T^^^) gives the desired lower bound 

Rr(A,G)> ^Vr 

8C3 



r(2) 



(1) 



It seems from the proof that iV'S could be shghtly sharpened to iV**;^ " = Prt[/, 6 K]- 
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provided that our choice of e ensures that e < min(a, eo) ='■ fi that depends only on L. This condition is 
satisfied for all T > To = l/(2c3ei)^. Since ci, C3, and 61 depend only on L, for such T, Rt{G) > ^ Vt. 

The non-triviality of the game implies that Lemma [3] d) does not hold, so neither does b), that is, 
Rt{G) > for r > 1. Thus choosing 

Rr(G) ci 

c = mm mm 



and for any T, Rt{G) >cylT. □ 



Remark Theorem 13 also holds if M = 00. Namely, since the proof of c)^d) of Lemma [3] remains 
obviously valid, the non-triviality of the game {K > 2) excludes that c) holds, and thus for each / € N there 
is ji € {1,2,...} such that {{j. is not minimal in the column of L. Then take the minor of L consisting 
of its (at most N) columns corresponding to O = {ji, . . . , jat}. For the corresponding finite game Go (that 
does not depend on A), Lemma[3]c) still does not hold, thus nor d) does, and Go is also non-trivial. Hence 
Theorem [13] implies thaj^ 

Rr(G) = inf sup R7-(A,G)>inf sup R7-(A, G) - Rr(Go) - ( Vt) . 

^ yi:re{l,2,...)^ ^ jureO^ 

7. Lower Bound for Hard Games 

In this section, we present an Q(r^''^) lower bound for the expected regret of any two-outcome game in 
the case when the separation condition does not hold. 

Theorem 19 (Lower bound for hard games). If M = 2 and G = (L, H) satisfies the non-degeneracy 
condition and the separation condition does not hold then there exists a constant C > such that for any 
T > \ the minimax expected regret Rj{G) > CT^^^. 

Proof of Theorem^l9\ We follow the lower bound proof for the label efficient prediction from Cesa-Bianchi 
et al. [22] with a few changes. The most important change, as we will see, is the choice of the models we 
randomize over. 

As the first step, the following lemma shows that non-revealing degenerate actions do not influence the 
minimax regret of a game. 

Lemma 20. Let G be a non-degenerate game with two outcomes. Let G' be the game we get by removing 
the degenerate non-revealing actions from G. Then Rj{G) = Rt{G'). 

The proof of this lemma can be found in the Appendix. 

By the non-degeneracy condition and Lemma |20j we can assume without loss of generality that G does 
not have degenerate actions. We can also assume without loss of generality that actions 1 and 2 are the two 
consecutive non-dominated non-revealing actions. It follows by scaling and a reduction similar to the one 



we used in Section 5.1 that we can further assume (^1,1, ^'1,2) - (0, a), ({2,1,^2,2) - (1 - o,0) with some 
a € (0, 1). Using the non-degeneracy condition and that actions 1 and 2 are consecutive non-dominated 
actions, we get that for all / > 3, there exists some /I,- e M depending only on L such that 

4l > + (1 - i;X2.1 = (1 - W - a) , 

k2 > ^ih,2 + (1 - ^0^2,2 = ^-ia . 



'The same reasoning can be used to show that we could assume without loss of generality M < N m the proof of Theorem 



13 
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Let A^in ^ min,>3 Ai, A^ax ^ max,>3 Ai, and A* ^ imax - ^min- 

We define two models for generating outcomes from {1,2). In model 1, the outcome distribution is 
pi(l) = a+e, pi(2) = l-p\(l), whereas in model 2, piW = a-e, P2{2) = I-P2W with < e < min(Qf, 1- 
a)/2 to be chosen later. We use randomization replacing the outcomes by a sequence Ji, J2, . . . ,Jt of 
random variables i.i.d. according to pt, k e |1,2), and independently of the internal randomization of A. 
Let A^^'^'' be the expected number of times action / is chosen by A under p^ up to time step T, as in ([8]l. With 
subindex k, Pr^ and Ej- denote probability and expectation given outcome model k € {1,2), respectively. 
Finally, let n'"^^ - '^i>3 Note that, if e < eo with some eo depending only on L then only actions 1 and 
2 can be optimal for these models. Namely, action k is optimal under p^, hence Rr(A, G) can be bounded 
in terms of using Lemma [nj 



R 



t(A, G) > 2 Nfdi - 4, Pk) - J] Nl'\{i - 4, Pk) + Nf\{{3-k - 4, Pk) (15) 



ieN i=3 
i*k 



for k = 1,2. Now, by ( [T4] ), there exists t > depending only on L such that for all / > 3, £ij > (1 - Ai){l - 
a) + T and £{^2 ^ o:Ai + t. These bounds and simple algebra give that 

{€i - euPi) = (4i - + e) + (42 - ^i,2)(l -a-e) 

> ((1 - Ai){l - a) + T)(a + e) + (ai,- + r - a)(l -a-e) 
= (1 - Ai)e + T 

> (1 - ^max)e + T =: /l 

and 

(4 - 4,Pi> = (1 - o;)ia + e) - a{l - a - e) = e . 

Analogously, we get 

{£i - 4, P2) > /Iminf + T — f2 and <^i - 4, P2) = e . 



Note that if e < r/ max(|l - /Imaxk ^minl) then both f\ and /2 are positive. Substituting these into ( [T5| ) gives 

Rr(A,G)>/,A^J| + 6<_\. (16) 



The following lemma is an application of Lemma 18 and 16 



Lemma 21. There exists a constant c > (depending on a only) such that 



N^^^ > A^f -cTe^^ and Nf^ >N\'^-cTe 



Proof. We only prove the first inequality, the other one is symmetric. Using Lemma 18 with M = 2, i = 2 
and the fact that actions 1 and 2 are non-revealing, we have 



A^f - A^^') < T ^D{p2 II pi)A^g/2 . 



Lemma 16 with M = 2, p - {a,\ - a)^, and e = (e, -e)^ gives D{p2 \\ pi) < ce , where c depends only on 
a. Rearranging and substituting c = ^fF/2 yields the first statement of the lemma. □ 
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Let / = arg minj(:g|i,2} A^>'3^- Now, forktl we can lower bound the regret using Lemma 



21 



fordl 



Rt(A, G) > fuN^^ + e ( Nf_^ - cTe Ja^^ 1 > /.A^g + ^ I ' Ja^^ ) , (17) 



as > 0. For A; = / we do this subtracting cTe^ a/^>3 - ^ froni the right-hand side of (16^ leading to the 



same lower bound, hence ( 17 1 holds for k = 1,2. Finally, averaging ( 17 1 over A: € {1, 2) we have the bound 



^ ' -cTeJN^^l 



(1 - ^max + ^min)f \ . 
^ +T\N'^^ + e 



^>3 



Choosing e = c^T (< C2) with C2 > gives 



Rr(A,G)> T 



7-2/3 



^3 2 



>3 



> T 



2 2 



,2-7,1/3 /a 7(0 



>3 



where x - T Jn>\ ^rid <7(x) can be written and lower bounded as 



q{x) 



T —\\X- 



2t - A*C2^ 



C2 



2„4 



c c 



2 4t-2A*C2 2 \ 2t-A*C2 



2 

C C2 



independently of x whenever A*C2 < 2t and C2 < 1- Now it is easy to see that if C2 = min(T/(c^ + A*), 1) 
then these hold, moreover, q{x) > C2/4 > giving the desired lower bound 



Rt{A,G)> 



£2^,2/3 



provided that our choice of 6 ensures that e < min(a/2, (1 -a)/2, eo, ''"/U -^maxl, I'/Umml) =: that depends 
only on L. This condition is satisfied for all T > Tq = (c2/ei)^. Since C2 and ei depend only on L, for such 
T, RriG) > '-^T^'^. 

If the separation condition does not hold then the game is clearly non-trivial which, using Lemma |3]b) 
and d) as in the proof of TheoremflS] implies that Rr(G) > for T > 1. Thus choosing 



^ ... Rt{G) C2 

C = mm mm ,— 



C> and for any T, Rt{G) > CT^'^ . 



□ 
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8. Discussion 

In this paper we classified non-degenerate partial-monitoring games witli two outcomes based on tlieir 
minimax regret. An immediate question is how the classification extends to degenerate games. Unfortu- 
nately, the degeneracy condition is needed in both the upper and lower bound proofs. We do not even know 
if all degenerate games fall into one of the four categories or there are some games with minimax regret 
of 0(r") for some a € (1/2,2/3). Nonetheless, we conjecture that, if the revealing degenerate actions are 
included in the chain of non-dominated actions, the classification theorem holds without any change. 

The most important open question is whether our results generalize to games with more outcomes. A 
simple observation is that, given a finite partial-monitoring game, if we restrict the opponent's choices to 
any two outcomes, the resulting game's hardness serves as a lower bound on the minimax regret of the 
original game. This gives us a sufficient condition that a game has Q.(T^^^) minimax regret. We believe that 
the n(r^''^) lower bound can also be generalized to situations where two "e-close" outcome distributions 
are not distinguishable by playing only their respective optimal actions. Generalizing the upper bound result 
seems more challenging. The algorithm AppleTree heavily exploits the two-dimensional structure of the 
losses and, as of yet, in general we do not know how to construct an algorithm that achieves 0{ VT) regret 
on partial-monitoring games with more than two outcomes. 

It is also important to note that our upper bound result heavily exploits the assumption that the opponent 
is oblivious. Our results do not extend to games with non-oblivious opponents, to the best of our knowledge. 



Appendix A. 

Proof of Lemma^ a)^b) is obvious. 
b)^c) For any A, 



Rt{A,G)> sup E 

jeMJi=-=JT=j 



t=l 



supE 

jeM 



b) leads to 



t=i 

^ ieN ■' 

t=l - 

>sup(E U,J-min{i,j]^f(A) 



= Rt{G) - inf R7-(A,G) > inf /(A) 

A A 



Observe that /(A) depends on A through only the distribution of /i on N denoted by ^ = q{A) now, that 
is, /(A) = f'iq) for proper /'. This dependence is continuous on the compact domain of q, hence the 
infimum can be replaced by minimum. Thus min^ f'iq) < 0, that is, there exists a q such that for all j e M, 
E = vai^ieN £i,j- This implies that the support of q contains only actions whose loss is not larger than 
the loss of any other action irrespectively of the choice of Nature's action. (Such an action is obviously 
non-dominated as shown by any p e Am supported on all outcomes.) 

c) ^d) Action / in c) is non-dominated, and any other action with loss vector distinct from ^, is dominated 
(by / and any action with loss vector ij). 

d) ^a) For any action / € N_, as in the proof of Lemma 14 consider the compact convex cell C,- in A^. 



By Lemma 15 lJi^£) Cj = Am- This and d) imply that there is an / with C, = Am, that is, / is optimal for any 
outcome. So the algorithm that always plays / has zero regret for all outcome sequences and T. □ 
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Proof of Theorem^Case ( fid] ). We know that K >2 and G has no reveaUng action. Then for any A, 

' T T ' 

_f=l - t=l 



Rt{A,G)> sup E 

jeM,Ji=-=JT=j 



;=i Lf=i 



1 ^ 



E 



M 



— > min ^; ; 



Here It is a random variable usually depending on Jy-.T-i, that is, on j through the outcomes. However, since 
G has no revealing action, now the distribution of /, is independent of j, thus E[X^j fj^j] > min,gAr T,f=i 
for each t, and we have 



Rr(A,G) > T 



M 



M 



min > (i j - > min Hi 



7=1 



= cT , 



where c > if A' > 2 (because c > 0, and c = would imply Lemma |3]c), thus also d)). Since c depends 
only on L, Rt{G) > cT = @(T). □ 

Proof of Lemma 15_ By Definition[Tj action / is dominated if and only if C, c IJ,/.^,^^. C,/. 
Q c U;':f,,^/, Q' ^ intC, = 0: Since f,- ^ / /, follows from 
intC; = ^ ^Cj) - 0: Follows from convexity of Q. 

/1(C,) = ^ C,- c Ci''. indirect: if e C, is in the complementer of Q', that is open in 

Am, then there is a neighborhood 5 of in Am disjoint from \Ji';t.,^t. Ci>. Thus 5 c [Jj>.(.,^g. Ci> = Ci due 
to (|4]l, and /1(C,) > A{S)> 0, contradiction. 

Since ^(U,e2) Ci) < Y.ieD ^(C,) - 0, thus from Q ^(U,^2) Q) > ^CAm), and ^(Am \ U/^o C,) - 0. The 
latest set is open in Am, so it must be empty, that is, \Ji^<£, Ci = Am- □ 



Proof of Lemma 17 Clearly, the worst-case expected regret of A is at least its average regret: 

T T 



Rt{A,G)= sup RT{A,G)>Ek[RTiA,G)]=Ek 

jl:TSMj 



f=l - t=l 



where the expectation on the right-hand side is taken with respect to both the random choices of the out- 
comes and the internal randomization of A. We lower bound the right-hand side switching expectation and 
minimum to get 



E, 



t=i 



Z - Z ^ Z - Z 

t=\ - t=l 

T N T 

= Z Z P = i) kj,] - J^in Z^^'' P^'^ 

T N 

= y y E^ I (It = i) Ek ii,j, - T min(f,-, pu) 



t=\ i=l 
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(by the independence of /, and Jt) 

N T 

= Yi^hPk) y PnUt = i] -Trmn{ti,pk) 

1=1 t=l - 

N 



= Y,Nf\ei,Pk)-T{€k,Pk) (A.l) 
(=1 

= Y,^f\ii-h,Pk) . 



ieN 
i*k 



( |A.1| ) follows from the fact that action k is optimal under pk- Clearly the term / - k can be omitted in the 
last equality. □ 



Proof of Lemma 18 We only prove the first inequality, the other one is symmetric. Assume first that A 
is deterministic, that is, /, : 2'"' N, and so It{h\-t-\) denotes the choice of the algorithm at time 
step t, given that the (random) history of observations of length t - \, Hi j-i - (Hi, . . . , Ht-i) takes 
hi:t-i = (hi, . . . , ht-i) e E'"^. (Note that this is a slightly different history definition than OHi-j-i defined in 
Section [5?T| as H\-t-\ does not include the actions since their choices are determined by the feedback any- 
way. In general, 'Hi j-i is equivalent to Hi j-i U (/i, It-i). Nevertheless, if it is assumed that the feedback 
symbol sets of actions are disjoint then Hi j-i and 'Hi j-i are equivalent.) We denote by the joint distribu- 
tion of Hi-T-i over iJ'^ associated with p^. (For games with only all-revealing actions, assuming hij = j 
in H, p*i^ is the product distribution over the outcome sequences, that is, formally, pIUi-.t-y) = Yijji PkUt)-) 
We can bound the difference n'^'' - N2'' as 

T 

t=\ 

T 

= Xi(^^^'^''l^'-l^ = ''^^2(/'l:T-l)-I[a(/il:r-l)-0;?I(/il:r-l)) 

T 

= Z (pl{h.,T-i)-pl{h.,T-i))-Y,^iI,(hv,-i)-i) 

/ii:r_ieS^-' '=1 
<T {p*2ihl:T-l) - P*i(h.,T-l)) (A.2) 



P2ihl:T-l)>P*i{hl:T-l) 
T M M 

2 ll^2-/'i|li 



< T 



^D{pl \\p\)ll. 



where the last step is an application of Pinsker's inequality Il24l Lemma 12.6.1] to distributions p* and p^. 
Using the chain rule for KL divergence 11241 Theorem 2.5.3] we can write (with somewhat sloppy notation) 



D{P*2 \\P\) = Y,^ (pliht I II plih, I hu-i)) , 

t=i 
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Pr2(//r-/l,|//l:,-l =/ll:,-l) 

\-T2{tit = n, I = /ii:,-i;in 



,{Ht = ht\Hi.,.,=h,.,.i) 
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(A.3) 



Decompose this sum for the case It{h\ j-i) ^ H and It{hi-t-\) e In the first case, we play a none -revealing 
action, thus our observation Ht = ,_,),/, = is a deterministic constant in both models 1 and 2, 

thus both Pri(- | Hi-t-i = hi j-i) and Pr2(- | Hi j-i = hi j-i) are degenerate and the KL divergence factor is 
0. Otherwise, playing a revealing action, //,= hm,^.,_^)j, is the same deterministic function of Jt (which is 
independent of Hy j^i) in both models 1 and 2, and so the inner sum in (|A.3|) is 



V o r/ / n Pl'2[/j/,(fti,_,),/, = /Jf] 
} , Pr2[/i/,(M:,-i),/, - /J^] In ^-TT TTl ■ 



(A.4) 



Since Pr*:[/i/,(/„ ,_!),/, ^ ^ I,j,eM:h,,g,^,^_,^j=h,PkUt) (k = 1,2), using the log sum inequality ||2l Theorem 
2.7.1]), (|A.4|) is upper bounded by 



Z Z ^2(jV)ln^ = 2]M7V) 



j,eM 



In — — = D{p2 II Pi) 

/'I Of) 



Hence, Dip^ || is upper bounded by 

r-i 



r-i 



2] 2] Pv2{Hi t-i = h,,r.,)D{p2 II pi) - D{p2 II 7^1)2] 2] Pr2[/r - /] - D{p2 \\ pi)K 

'=1 />i:,-ieS'-' 



(2,r-l) 
rev 



where N^^J = I^fj/ Pr;t[/f e "^l. This together with gives A^f ^ - A^^^^ < T ^D(p2 II Pi 

If A is random and its internal random "bits" are represented by a random value Z (which is independent 
of 7i,/2,- ■ ■ ), then Nf - E [A^f \Z)] for A^f \Z) - Pr^[7f = /|Z]. Also let A^iV^^C^) = Zfl"/ P^tUt e 
•RIZ]. The proof above implies that for any fixed z € Range(Z), 



Nf\z) - nI'\z) < T ^D{p2 II pi)A^eV"'\z)/2 , 



and thus, using also Jensen's inequality, 

Nl'^-Nl'^ = E[NfiZ)-Nl'\Z)] 

< E 



T^D{p2\\piW^J-'\Z)ll 



< T ^D{p2 II pi) E [//^e/"'\Z)] 12 = T y[Dip 



2 II Pl)A^^e/''V2 , 



that is clearly upper bounded by T -\Id{p2 \\ Pi)N)^ll2 yielding the statement of the lemma. 



(2), 



□ 



Proof of Lemma 20 We prove the lemma by showing that for every algorithm A on game G there exists an 
algorithm A' on G' such that for any outcome sequence, R7'(A', G') < Rr(A, G) and vice versa. Recall that 
the minimax regret of a game is 



Rr(G)-inf sup Rr(A,G), 



Classification of finite partial-mon. games (Saturday 12'^ January, 2013 @ 11:40) 



28 




□ Revealing non-dominated action 
Non-revealing degenerate action 



Figure A. 10: Degenerate non-revealing actions on the chain. The loss vector of action 2 is a convex combination of that of action 
1 and 3. On the other hand, the loss vector of action 4 is component-wise lower bounded by that of action 3. 



where 



Rr(A,G) - E 



t=i - t=i 



First we observe that the term E[min,g/v ^^^j does not change by removing degenerate actions. Indeed, 
by the definition of degenerate action, if the minimum is given by a degenerate action then there exists a 
non-degenerate action with the same cumulative loss. It follows that we only have to deal with the term 

1. Let A' be an algorithm on G'. We define the algorithm A on G by choosing the same actions as A' at 
every time step. Since the action set of G is a superset of that of G', this construction results in a well 
defined algorithm on G, and trivially has the same expected loss as A'. 

2. Let A be an algorithm on G. From the definition of degenerate actions, we know that for every 
degenerate action /, there are two possibilities: 

(a) There exists a non-degenerate action i[ such that £[ is component-wise lower bounded by £i^ . 

(b) There are two non-degenerate actions ii and 12 such that £i is a convex combination of £i^ and 
^(•j, that is, = a,^,-, + (1 - a,)^(2 for some a, € (0, 1). 



An illustration of these cases can be found in Figure A. 10 We construct A' the following way. At 
every time step t, if if (the action that algorithm A would take) is non-degenerate then let if =lf- 
If if = / is a degenerate action of the first kind, let if be ii. If if = / is a degenerate action of 
the second kind then let if be /i with probability a, and ^2 with probability 1 - a,. Recall that G is 
non-degenerate, so / has to be a non-revealing action. However, ii and/or 12 might be revealing ones. 
To handle this. A' is defined to map the observation sequence, before using it as the argument of It, 
replacing the feedbacks corresponding to degenerate action / by = hi^- That is, intuitively. A' 
"pretends" that the feedbacks at such time steps are irrelevant. It is clear that the expected loss of A' 
in every time step is less than or equal to the expected loss of A, concluding the proof. 

□ 

Proof of Theorem \13\for adversarial nature 

For the proof, we start with a lemma, which ensures the existence of a pair i\,i2 of actions and an 
outcome distribution p with M atoms such that both i\ and are optimal under p. 



Lemma 22. Let G - (L, H) be any finite non-trivial game with N actions and M >2 outcomes. Then there 
exists p € Am satisfying both of the following properties: 
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(a) All coordinates of p are positive. 

(b) There exist actions i\,i2 £ N_ such that + li^ and for all i € A[, 

<^,-,,;7>-<4,p><(4p>. 



Proof of Lemma 22 Note that distributions p with positive coordinates form the interior of (int Am)- 
For any action / € A^, as in the proof of Lemma [T4j consider the compact convex cell C; in A^, whose 
union is A^ (see Q). Let p\ be any point in the interior of A^- By Q, there is a cell C,-; containing p\. 
If C/, = Am held then action i\ would satisfy Lemma[3]c), thus also d), and the game would be trivial. So 
there must be a point, say p2, in Am \ Ci^. The intersection of the closed segment pipi and C,-, is closed 
and convex, thus it is a closed subsegment p\p for some p € C,-, {p + pi)- p\ € int Am and the convexity of 
Am imply p € int Am- Since the open segment pp2 has to be covered by Ur'iC /^tc, C;', that is a closed set, 
p € U/:C/5tc, Cj' must also hold, that is, p € for some Q, + Ci^ (requiring + IQ. Hence p satisfies 
both (a) and'(b). □ 

Proof of Theorem [7j] When M = 1, G is always trivial, thus we assume that M > 2. Without loss of 
generality we may assume that all the actions are all-revealing. 

Let p € Am be a distribution of the outcomes that satisfies conditions (a) and (b) of Lemma [22] By 
renaming actions we can assume without loss of generality that {\ {2 and actions 1 and 2 are optimal 
under p, that is, 

{ei,p)^{h,p)<{ii,p) (A.5) 

for any / € N_. 

Fix any learning algorithm A. We use randomization replacing the outcomes by a sequence J\,J2,---,Jt 
of random variables i.i.d. according to p, and independent of the internal randomization of A. Clearly, as in 



the proof of Lemma 17 the worst-case expected regret of A is at least its average regret: 



Rr(A,G) > E[Rr(A,G)] - E 



t=\ 



E 



t=\ - t=\ 



(A.6) 



Here, in the last two expressions, the expectation is with respect to both the internal randomization of A and 
the random choice of Ji,J2, ■ ■ ■ , Jt- Now, since Jt is independent of It, we see that ^[£i,j, \ It] - {ii,,p)- 
By ( |A.5[ ), we have {li,,p) > (^1, p) = {(2, p)- Therefore (upper bounding also the minimum). 



t=i - 1=\ t=\ - 1=\ 

T T 

>Yi^h,p)-rnmY^lij, 

T 

= maxy ((A,p)-4/,) . 

(=1,2 ^—f 



(A.7) 
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Using the identity max{a, b} = ^{a + b + \a- b\), the latest expression is 

. r r T T T 



t=i 



T 



t=l 



f=i 



f=i 



where ( |A.5| ) was used in the first term. The expectation of the first term vanishes since E[/';_yJ = (^;, p). Let 
- ^2,7, - We see that X\,X2, ... 5X7- are i.i.d. random variables with mean E[X<] = 0. Therefore, 



E 



max2«A,P>-47,) - 



> cVr, 



(A.8) 



where the last inequality follows from Theorem 23 stated below and the constant c depends only on /'i, ^2> 
and p. For the theorem to yield c > 0, it is important to note that the distribution of X, has finite support and 
with positive probability X, 7^ since t\ + 12 and all coordinates of p are positive. Hence, both E[Xy2] and 
E[X^] are finite and positive. 

Now, putting together ( A.6 1, ( |A.7| ), and ( |A.8| ) gives the desired lower bound Rr(A, G)> c Vt. Since c 
depends only on L, also Rr(G) > c Vt. □ 

The following theorem is a variant of Khinchine's inequality (see e.g. [20, Lemma A.9]) for asymmetric 
random variables. The idea of the proof is the same as there and originally comes from Littlewood [25]. 

Theorem 23 (Khinchine's inequality for asymmetric random variables), l^t Xi,X2, ■ ■ ■ ,Xt be i.i.d. random 
variables with mean F,[Xt] = 0, finite variance E[Xj] = Var(Xf) = cr^, and finite fourth moment ¥i[X^] = p^. 
Then, 

> 



E 



t=\ 



^ Vr. 



/3/I4 



Proof. f^S', Lemma A.4] implies that for any random variable Z with finite fourth moment 



E|Z|> 



(E[Z4]) 



■41^1/2 ■ 



Applying this inequality to Z = X,=i we get 



that follows from 



E 



E[Z2 



E 



> 



T^J3Ju 

2- 



cr 



Vr, 



f T \ 

\t=l J 



y3//4 

T 



t=i 



and 



E[ZT = E 



f T 



x4i 



=Y ] + 6 2] E[X.?] E[X2] = r/14 + 3r(r - l)cr4 < 3rV, 

Vf=l j J /=1 l<.«<f<7' 

where we have used the independence of X/s and E[Xf] = which ensure that mixed terms E[X,Xi], 
E[X,X^], etc. vanish. We also used that cr^ = E[X2]2 < E[Xf] - /14. □ 
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